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Abstract. Conversations allow the quick transfer of short bits of information and it is reasonable to expect 
that changes in communication medium affect how we converse. Using conversations in works of fiction 
and in an online social networking platform, we show that the utterance length of conversations is slowly 
shortening with time but adapts more strongly to the constraints of the communication medium. This 
indicates that the introduction of any new medium of communication can affect the way natural language 
evolves. 
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1 Introduction 

With an estimated vocabulary size of 20,000 to 40,000 
base words [Tl[51[3], conversations quickly transfer short 
bits of information via two general means: the oral and 
the written form. Although the written vocabulary is often 
larger [?] , the grammatically looser and more error-prone 
oral medium has the advantage of having access to nonver- 
bal cues like gestures and intonations |S] to aid communi- 
cation. Aside from vocabulary size — word choices, uncon- 
sciously repeating words, and other idiosyncrasies [6] also 
affect the way we perceive conversations. 

Conversation analysis typically looks into how turn 
taking patterns in institutional settings depart from those 
observed in informal conversations [7], or on the psycho- 
logical or sociological aspects [5] of social structure. In this 
work, the length distribution of a single speaking turn, or 
utterance, was derived to determine if the medium affects 
the way we express ideas by using datasets that include a 
mix of real-world (online) and fictional (offline) conversa- 
tions: online conversation in Twitter (twitter . com); con- 
versations from 19th century novels and short stories; and 
subtitles from 20th century movies. 

Humans typically converse orally, thus the analysis of 
conversations is usually performed by transcribing recorded 
audio conversations into text. In cases when this is not 
possible e.g., before the invention of recorded audio, one 
technique is to use written records of real and constructed 
conversations as were done in studies on the emergence 
of complementary clauses (Paul persuaded John to kiss 
Mary) [9j, the use of do in negative declaratives (I do not 
understand you) [10] , and the increasing prevalence of the 
modals gonna, gotta and wanna Written records of 



spoken speech are also included in corpora like A Cor- 
pus of English Dialogues 1560-1760 |12| and The Cor- 
pus of Historical American English: 400 million words, 
1810-2009 [T3|. However, only conversations (fictional di- 
alogues) in novels, short stories, and movies were analyzed 
in this paper because utterances tend to be less narrative 
and directed to another person unlike in other genres like 
drama comedies or trial transcripts. Although it has been 
shown that styles vary across and even within authors [14] , 
we assumed that conversations in their works are mostly 
independent of the author's style, i.e., a conversation in 
their works conveys how another person (character), and 
not how the author, speaks. Furthermore, errors due to 
transcribing are practically eliminated when using books 
and movies. 

Twitter, as a form of computer-mediated communica- 
tion, is different from oral or written media |15| . While 
assumed to be happening in real-time, the purely writ- 
ten nature of a Twitter-based conversation differentiates 
it from the transcribed oral communication in books and 
movies. In addition. Twitter conversations have an explicit 
length limit — an utterance can only be up to 140 charac- 
ters long. 

Putting a length constraint on the outset would show 
drastic changes. A case in point would be SMS messages. 
At its peak, textspeak looked very much different from 
standard spelling — primarily due to the effort it takes to 
spell out words through a numerical keypad. Tweets, how- 
ever, was largely spared from this phenomenon and usu- 
ally have correct spelling. Among the three media ana- 
lyzed in this study, Twitter is the only considered medium 
that is constrained. Conversations in books and movies 
are supposedly oral conversations that were written down 
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in the form of a book or a subtitle so their written form 
should have no effect on them. 

We now argue that if conversations are independent of 
medium, then no significant difference should be observed 
among conversations in Twitter, books and movies. On 
the other hand, if differences in a medium is due to an ex- 
plicit quirk in the medium e.g., an utterance length limit, 
then conversations in Twitter must be significantly differ- 
ent from conversations in books and movies, but the latter 
two should not be significantly different from each other. 
Finally, if conversations are indeed dependent on medium, 
then conversations in Twitter, movies, and books must be 
significantly different from each other. 



where a and s are fitting parameters that describe the 
shape and ordinate scaling factor, respectively, were then 
fitted on each distribution using the maximum likelihood 
estimation [28] feature of the Scipy python module [29] . 
For each trial, 100,000 sentences were generated follow- 
ing the fitted sentence length (in words) and word length 
(in letters) distributions. This process was repeated for a 
total of 100 trials resulting to 100 sentence length in let- 
ters histograms. The histograms were converted to a single 
probability distribution by using the median frequency for 
each sentence length. 



2 Orthographic sentence length and the 
Brown corpus 

The study of sentence lengths in text dates back to the 
1939 paper of Udny Yule [T5] where it was used to establish 
authorship. More recently, sentence length has been used 
to classify text genre by itself fTT] or in combination with 
other text properties [TSJ. Yule's 1939 paper did not pro- 
vide the sentence length distribution but several decades 
after its publication, the distribution was described as log- 
normal [1^1201121] which was later shown by Sichel [^[^ 
to be flawed. More recently, Sigurd et al. [24] showed that 
sentence length distributions may be approximated by a 
gamma distribution. 

In this work, we used the non-standard unit of number 
of characters (orthographic length), instead of the usual 
sentence length units of clauses or words, in measuring ut- 
terance lengths for ease of comparison with Twitter which 
has a maximum utterance length in terms of characters. 
Although the distribution of sentence lengths in terms of 
words and word lengths in terms of letters can be de- 
scribed by a gamma distribution [241. there is no mathe- 
matical guarantee that the distribution of sentence lengths 
in terms of letters would also follow the same distribution 
in the general case of different shape and scale parame- 
ters of the sentence length (in words) and word length (in 
letters) distributions. If it can be shown that the sentence 
length (in letters) distribution can be approximated by a 
member of the same distribution family as the sentence 
length (in words), then the use of sentence length com- 
parison using orthographic length is a valid approach. 

The Brown corpus [53] consists of about one million 
words of edited English prose printed during 1961 in the 
United States [26] . To verify if measuring sentence lengths 
in terms of characters may be approximated by a gamma 
distribution, the sentence length (in letters) distribution 
was simulated, as follows. The word length (in letters) 
and sentence length (in words) distributions of the tagged 
Brown corpus was first constructed using the natural lan- 
guage toolkit [2T Python module. In constructing the dis- 
tributions, only words that contain at least one letter were 
considered. A gamma distribution given by, 

a-l x/s 

Pr{x) = , (1) 

s"i (a) 




sentence length, r (letters) 



Fig. 1. (a) Word length in letters and (b) sentence length 
in words distributions of the Brown corpus superimposed with 
the maximum likelihood estimate of Eq. (|T]) (solid line), (c) 
Simulated sentence length (solid dots) in letters distribution 
using the fitted word length (in letters) and sentence length (in 
words) distributions of the Brown corpus superimposed with 
the least-squares fit (solid line) and values within one standard 
deviation (shaded) 



Both the word length (WL, in letters) [Fig. [TJa)] and 
sentence length (SL, in words) [Fig. [ijc)] distributions of 
the Brown corpus follow a gamma distribution (WL: a — 
3.43, s = 1.39, r2 = 0.948; SL: a = 2.09, s = 8.44, = 
0.989). The simulated sentence length in letters distribu- 
tion [Fig. [ijb)] also follows a gamma distribution {a — 
1.98, s — 51.2) but has a much larger s than the sentence 
length in words which is expected since letters is a smaller 
syntactic unit than words. 

The sentence length distribution in letters thus belongs 
to the same family of distributions as when measured in 
words. Since utterance lengths are being compared empir- 
ically, the use of orthographic length as a unit of utterance 
length is therefore valid despite known idiosyncrasies [30] 
of the English language. Interestingly, the orthographic 
length was also used by Piantadosi et al. [3T] when they 
showed that word lengths are optimized for efficient com- 
munication because it is easier to measure while still be- 
ing highly correlated with word length in terms of sylla- 
bles [32|- 
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3 Datasets 

Four datasets were used for our analysis: utterances in fic- 
tional works in Project Gutenberg (PG)(guteiiberg. org), 
utterances in PG split into sentences (PGS), tweets from 
Twitter (twitter), and utterances in movie subtitles (subs 
from opensubtitles.org. 

PG was generated by extracting utterances — defined as 
text enclosed in double quotes — from the available works 
in Project Gutenberg of 50 authors whose selection was 
roughly based on availability (see Ref. [33] for list of ti- 
tles, and Ref. [33] for author selection and text parsing de- 
tails). The resulting dataset consists of about 2.3 million 
utterances, with zero-length utterances (0.01% of origi- 
nal dataset) removed. The author with the most number 
of utterances (George Manville Fenn) has 238,640 utter- 
ances while the author with the least number of utterances 
(David Herbert Lawrence) has 1,170 utterances. The me- 
dian number of utterances is 36,955 utterances per author. 
When split into sentences, PG is converted to PGS which 
has about 4.2 million utterances with a median number of 
utterances equal to 69,311 utterances per author. 

Conversations in twitter were identified by looking 
for replies, which are Twitter messages (or tweets) directed 
to specific users. We used the convention that replies be- 
gin with the Ousername of the receiver, e.g., @bob Hello! 
How are you? to filter the tweets for our datasetj^Though 
not in the original design, the use of replies emerged as 
the leading method of addressing a particular person in 
Twitter |35j . The presence of an Susername anywhere in 
the tweet makes that tweet a mention [36 . Unlike men- 
tions, which appear in the timeline of a user following the 
sender, a reply appears in said user's timeline only if he fol- 
lows both sender and receiver of the reply message. Thus, 
conversations are most likely restricted to replies to avoid 
flooding the timeline of people not involved in the discus- 
sion. Though mentions may carry conversations, we still 
excluded them from the dataset, as they are more likely 
non-conversational tweets. 

It is possible that a reply is not reciprocated, e.g., if it 
was meant to bring an item, such as a URL, to the atten- 
tion of another user. This is still considered a conversation 
because it conveys a short bit of information directly tar- 
geted to a certain user. This is similar to someone telling 
another to "watch out!" or "be careful": a reply by the 
other person is not required. 

Using the Twitter Streaming application programming 
interface (API) [33, five one- week sampled public tweets 
from September 2009 to July 2010 were selected. From 
the one-week samples composed of around 16.2 million 
to 57.6 million tweets representing about 15% of pub- 
lic tweets \37. . nonzero-length messages were extracted 
which yielded about 52 million messages or utterances (see 
Ref. [33] for datasets and parsing details) . For better com- 
parison with PG and PGS that have 50 subsets (authors) 



^ The current Twitter API supports a method for explicitly 
classifying a tweet as a reply but this was not yet widely avail- 
able and followed when our data were gathered. 



each, the weekly datasets were subdivided into ten groups 
of shuffled hourly data. 

SUBS consists of about 14.7 million utterances from 
15,809 movies provided by opensubtitles . org. The movie 
release years span from 1896 to 2010. See Ref. (33] for pars- 
ing details and Ref. [SS] for the complete list of movies. 

4 Utterance length distributions of datasets 

Twitter conversations [Fig. [2][a)] have an asymmetric and 
bimodal utterance length distribution. The left peak (mode) 
is at 16 characters which we take to be the natural distri- 
bution of message lengths i.e., it is the distribution of an 
unrestricted conversation. Similar to the argument used by 
Sigurd et al. [23] in their study of word and sentence length 
distributions of English, Swedish and German texts, and 
by Cancho and Sole [39] in their work on the origin of 
Zipf 's law, we posit that the length of an utterance in a 
conversation is also governed by a trade-off between pack- 
ing as much information as possible in an utterance and 
expressing the utterance as quickly as possible: the flrst 
objective is biased towards increasing length x"^^) 
while the other is biased towards decreasing it (^ e~^). 
Combining the two objectives, the following distribution 
is obtained: ^ x°'~^e~^. 




message length (chars.) 



Fig. 2. (a) Message length distribution of sampled tweets with 
the curve fit having the highest r^ value (a — 1.37, solid line). 
Error bars are standard deviations from five one-week samples, 
(b) The a values (filled squares) of the fit from a; = to a;c 
using Eq. (2) and its corresponding (unfilled triangles). 

To account for a strict length limit for Twitter mes- 
sages, the natural utterance length distribution was esti- 
mated by fitting a more general equation using a modi- 
fied Levenberg-Marquardt least squares algorithm [33] to 
the utterance length distribution from a; = to a cut-off 
length a;e e [16, 140] [Fig. [2];b)], 

where x — (x—xq) /s is the scaled utterance length x, while 
a, xq and s are fitting parameters that describe the shape, 
translation and ordinate scaling factor, respectively. This 
method of estimation assumes that the mixing parame- 
ter of the bimodal distribution is almost one in favor of 
the natural utterance length distribution. A bimodal dis- 
tribution fitted using expectation maximization was not 
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utilized because of a lack of an explicit model of the trun- 
cation distribution. Our goal is to estimate the median 
of the natural utterance length distribution so a resulting 
non-normalized unimodal distribution is acceptable. 

When a approaches one, Eq. ([2]) approaches an ex- 
ponential distribution. The range of acceptable values of 
a e [1.1,1.6], [r2 e (0.86,0.93)] for the Twitter dataset 
corresponds to a 57-order-of-magnitude increase in like- 
lihood of finding an utterance length oi x — — 140 
chars, compared to an exponentially decaying curve in the 
absence of a Twitter-imposed limit (see Ref. [27] for the 
fitting parameters distributions). However, another peak 
was found at 124 characters due to the 140-character limit, 
a limit that is absent in the other datasets, and is at- 
tributed to various tweet-shortening schemes. The absence 
of a length limit results to unimodal utterance length dis- 
tributions for PG, PGS and SUBS [Fig. [s]. 



xo = 0.87, s ^ 10.7, r2 = 0.988; Fig.jl^f)] fits Eq. ^ and 
has almost no tail (1 - F(140) = 1.19 x 10^). Thus, all 
datasets share the same distribution family as the Brown 
sentence length in words distribution further giving cre- 
dence to the validity of the use of characters as a unit of 
utterance length. 

The mean length of utterance (MLU) is used to eval- 
uate the level of language development of a child [jQlBTj . 
However, the use of the mean as a measure of central ten- 
dency is invalid because the utterance length distribution 
is very skewed to the right. The mode of a gamma distribu- 
tion [Eq. S] is given by {a.— V)s+XQ but it does not appear 
to be correlated with s [Fig. Ilia)]. In contrast, the median, 
though not having a closed form equation for a gamma dis- 
tribution, appears to be more correlated with s [Fig. |4][b)] : 
a larger median roughly implies a larger spread. The me- 
dian, therefore, allows us to simultaneously describe both 
the location and scale of the utterance length distribution. 
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Fig. 3. Utterance length distributions of (a) different authors 
in PG (b) different authors in PGS and (c) 50 randomly selected 
movies in SUBS. Distribution of utterance lengths over the en- 
tire (d) PG, (e) PGS and (f) SUBS datasets fitted with Eq. 
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Fig. 4. Mode and median of the distribution fits, (a) 

Mode and (b) median of the fit of each distribution plotted 
against s. 



Conversations in movies (interquartile range IQR — 
difference between the 3rd and 1st quartiles — 21 chars.) 
are of more uniform length than those in books (PG IQR 
median = 88 chars, PGS IQR median — 50 chars.). The 
much smaller SUBS IQR median compared to that of twit- 
ter (IQR median = 46 chars.) or that of its best fit of 
Eq. ([2]) (IQR median = 50 chars.) suggests that conversa- 
tions in movies are less dependent on author style while 
the much larger IQR medians of PG and PGS point to a 
stronger dependence of these media on author style. 

To minimize the effect of unequal author or movie ut- 
terances, and of noise due to differences in spelling and 
punctuation, Eq. ([2| was fitted to PG, PGS and SUBS by 
computing for the normalized histogram of each author or 
movie then using the average probability for each utter- 
ance length as the probability density function to be fitted 
using least squares. Based on the fit of Eq. ([2| (a = 1.48, 

= 0.862, s = 34.4, r = 0.984), the PGS utterance length 
distribution [Fig. [3|^e)] seems to be a horizontally com- 
pressed TWITTER best fit curve (a — 1.37, xq = 0.86, 
s = 36.4) because of a smaller s value. The PG utterance 
length distribution has a fatter tail [1 - J^(140) = 0.0896; 
Fig. l3ld)] than that of the PGS utterance length distribu- 
tional - F(140) = 0.0427), and only its tail fits Eq. ^ 
quite well {a = 1.24, xq = 2.63, s = 48.6, = 0.970). In 
contrast, the entire SUBS median distribution [a ~ 2.71, 



For the rest of this paper, the median utterance length 
and its median were used to describe each utterance length 
distribution. These measures are suitable for comparison 
between datasets because both are insensitive to outliers 
(robust) and do not assume a distribution (nonparamet- 
ric). Any author dependence or deviation from a gamma 
distribution of the data would therefore not affect the re- 
sults ;34j . Tests for significant differences were performed 
using the Mann- Whitney U test [35] with continuity cor- 
rection because the distributions being compared are dis- 
crete and skewed. 



5 Utterance length and sample size 

TWITTER, PGS and SUBS were subsampled (with replace- 
ment) such that the sample size would be the same for 
each author's sample size in PG. By taking the distribu- 
tion of subsample medians (Fig. [5]) which is analogous to 
taking the distribution of sample means from normally- 
distributed data, we found that the median median ut- 
terance length (analogous to mean of sample means) of 
SUBS (25 chars.) is very different from that of twitter 
(38 chars.), PG (48 chars.) and PGS (41 chars.). 

Notably, the median median utterance length value of 
SUBS of 25 chars., which is not related to the existing max- 
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Fig. 5. Distribution of median utterance lengths (median me- 
dian utterance length: dashed lines) for (a) PG, (b) SUBS, (c) 
PGS and (d) twitter. The median utterance length data in (d) 
was estimated from the natural utterance length distribution 
of each twitter subset. 
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Fig. 6. Distribution of median utterance length in subsampled 
TWITTER (black), PG (dark gray), PGS (light gray) and SUBS 
(unfilled). 



imum subtitle line length of 32-34 characters (Ofcom reg- 
ulation [43]), points to a fundamental difference in how 
the verbal medium is used in movies. 

The median utterance length distribution of all datasets 
are significantly different from each other (see Ref. [27] for 
complete test results between each pair of dataset). Since 
the PGS median distribution is significantly different from 
the SUBS median distribution, conversational sentences in 
books are not the same as conversational sentences in 
movies though we posit that conversations in movies are 
closer to that of actual transcribed speech, twitter ut- 
terance lengths are stochastically smaller than PG and PGS 
but differ significantly from SUBS suggesting that Twit- 
ter is a less formal medium. We surmise that the smaller 
length is due to the more spontaneous and less formal tone 
of Twitter conversations than those in books. 

To investigate the effect of sample size N on the me- 
dian utterance length, each dataset was sampled (with 
replacement) into 50 groups each having N utterances. 
Similar to word frequency distributions that are depen- 
dent on A'' [H], the spread in, but not the location of, 
the medians distribution decreases as N increases (Fig. 
[6| for all datasets. At = 10^ utterances, the median 
value of SUBS collapsed to a single value of 25 characters. 
At iV = 10^ utterances, PG and PGS collapsed to different 
single median utterance length values of 48 and 41 char- 
acters, respectively, while twitter falls into two unique 
values of 38 and 39 characters. 

The median utterance length distribution of SUBS is 
very different from the median utterance length distribu- 
tion of the other datasets — it can be clearly distinguished 
from them even if the sample size is only A^ = 100 ut- 
terances (Fig. [6|. PG and PGS median utterance length 
distributions are already distinguishable from each other 
but both overlap with twitter at A^ = 100 utterances. 
The PG, PGS and twitter median utterance length dis- 
tributions do not overlap only at A^ = 10'* utterances, thus 
giving us the required minimum sample size for meaning- 
ful comparison across communication media as a function 
of time (see Ref. [34] for complete test results) . 



6 Utterance length through time 

The median median utterance length in both PG [Fig.[7]ja)] 
(slope = -0.266 chars. /yr, = 0.903, p < 10^^ two-sided) 
and PGS [Fig.[7];b)] (slope = -0.189 chars./yr, = 0.814, 
p < 10^^ two-sided) decreases with time but is not corre- 
lated with size (PG Spearman < 10~^; PGS Spearman 
p2 = 0.00524). 

On the other hand, the median utterance length of 
SUBS [Fig. ^c)] remains almost constant (^ 27 chars.) 
in time (slope = —1.897 x 10~^ chars./yr, = 0.121, 
p < 10^^ two-sided) except for a conspicuous rise and 
increased spread in the median utterance length at around 
1920 that does not flatten out even if the window size is 
increased from 1 year to 5 years [Fig. [Tjd)]. The bump 
is likely due to the availability of "talking pictures" and 
commercial television starting in the late 1920s. The silent 
movies prior to their release have a different "conversation 
signature" from those of "talkies" . 

The temporal behavior of twitter was not studied 
because twitter spans only a few weeks. 

7 Conclusion 

Though we do not usually notice the medium-dependence 
of conversations, we showed that conversations, as mea- 
sured by orthographic utterance length, are slowly short- 
ening in time within media but are drastically different 
across different media. These are fundamental differences 
that are effects not just of the milieu, but of the medium 
itself. Evolving technologies that lead to changes in com- 
munication media seemingly lead us to adapt our conver- 
sations, rather than such a technology suffering an early 
demise because it cannot adapt to our natural use of lan- 
guage. An extreme case in point is the short message ser- 
vice (SMS) or "texting." Originally designed with a char- 
acter limit of 160 such that most sentences would fit in a 
single text message [15], but with an "access a letter via 
numerical keypad" constraint — it became a popular form 
of communication with its own lingo [TTj. Clearly, 
adaptation occurs with changing medium and sometimes 
with unexpected side-effects. 
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Fig. 7. Median utterance length distribution of (a) PG and (b) 
PGS with window size of 10 years, and SUBS with window size of 
(c) 1 year and (d) 5 years. Only books with at least 1,000 utter- 
ances were considered. Publication years were retrieved from 
the US Library of Congress. The window sizes were selected so 
that the plots do not change appreciably when the window size 
is varied slightly. First to third quartiles (shaded), PG median 
median utterance length (a-b, solid line), PGS median median 
utterance length (a-b, dashed line). 
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